Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

Closed
wants to merge 11 commits into from
Closed

[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

wants to merge 11 commits into from

Conversation

rotationsymmetry
Copy link
Contributor

This PR includes an implementation of decay factors in streaming linear and logistic regression. Unit tests are also included.

The algorithm and design details are described in the document: https://docs.google.com/document/d/1UfKvuaaJVQCvh-wOLLYT8l7STQFjPxE7fitZyd0tqTo/edit?usp=sharing

Your comments and suggestions are highly appreciated. I will add more tests and ScalaDoc as suggested.

Thanks!

cc @freeman-lab @mengxr

def getDiscount(numNewDataPoints: Long): Double
}

private[mllib] trait StreamingDecaySetter[T <: StreamingDecaySetter[T]] extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need F-bounded polymorphism here? Does the code not work when you replace T with self.type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Does the code not work when you replace T with self.type?"

I guess not? For example,

trait Setter { def set: this.type = this}
class Apple extends Setter
val a = new Apple()
a.set

The return type of a.set is a.type, not Apple. Do I answer your question?

"Why do we need F-bounded polymorphism here?"
I agree with you that this is not needed here. Originally I included this as an extra level of type checking. But since I have self: T=> in the next line, I don't think we need it any more. I will remove it in the next push to the PR.

@feynmanliang
Copy link
Contributor

@rotationsymmetry you also have a merge conflict, sorry 😞 do you mind resolving?

@rotationsymmetry
Copy link
Contributor Author

@feynmanliang Thank you very much for your review.

I have incorporated your comments in commit a4ed2b0.

  • Add ScalaDoc for public API.
  • Add ScalaDoc to decribe the forgetful algorithm in StreamingLinearAlgorithm.
  • Remove F-polymorphism in StreamingDecaySetter[T].
  • decayFactor and timeUnit in StreamingDecaySetter[T] are now private.
  • Remove division by zero in trainOn of StreamingLinearAlgorithm; provide comments to explains why.
  • Improve testing cases of StreamingLogisticRegressionSuite to have rel tol=0.1.
  • resolve merge conflict.

As for your comment of "having getLambda instead of getDiscount in StreamingDecay", I feel that the discount factor better conveys the mathematical idea of the algorithm. Lambda, on the other hand, is only a temporary value in the calculation. For example, in the spark doc, the discount factor is employed to describe the algorithm. I have included similar description in the ScalaDoc for StreamingLinearAlgorithm.

Thanks again for your review. If you have any further comments, please let me know.

@@ -32,6 +32,11 @@ import org.apache.spark.mllib.regression.StreamingLinearAlgorithm
* of features must be constant. An initial weight
* vector must be provided.
*
* This class inherits the forgetful algorithm from StreamingLinearAlgorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[[StreamingLinearAlgorithm]]" so API docs generate a link, ditto for L37

@rotationsymmetry
Copy link
Contributor Author

@feynmanliang I have make another push to the PR:

Refactor StreamingDecay
Use case object for TimeUnit (only for the regression/classification. no change to StreamingKMeans until rely of the author)
Clean up ScalaDoc
Add tests for half life and TimeUnit

Thank again for your review.

@@ -101,4 +107,14 @@ class StreamingLogisticRegressionWithSGD private[mllib] (
this.model = Some(algorithm.createModel(initialWeights, 0.0))
this
}

override def setDecayFactor(decayFactor: Double): this.type = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boilerplate is duplicated in streaming linear regression. I am guessing you do this to get the concrete subclass (correct me if I'm wrong), but you actually don't need to do this since the this.type in trait StreamingDecay takes care of this. A simple REPL example:

scala> trait Superclass { def test: this.type }
defined trait Superclass

scala> class Subclass extends Superclass { def test = this }
defined class Subclass

scala> (new Subclass()).test
res0: Subclass = Subclass@1cb4ab3e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I meant that you could remove these setters entirely

scala> trait Superclass { def test: this.type = this }
defined trait Superclass

scala> class Subclass extends Superclass
defined class Subclass

scala> (new Subclass).test
res1: Subclass = Subclass@b364520

@feynmanliang
Copy link
Contributor

Made another pass

@rotationsymmetry
Copy link
Contributor Author

@feynmanliang Thank you for your comments. I have revised the PR, including

  • Refactor: timeUnit has its own setter.
  • Add @SInCE.
  • Clean up ScalaDoc.

As I am rewriting the ScalaDoc, it appears that the algorithm can be more easily described and understood if we rename decayFactor to retentionFactor. What do you think?

@feynmanliang
Copy link
Contributor

Streaming KMeans uses decayFactor and I think it's important we maintain consistency

*/
@Experimental
private[mllib] trait StreamingDecay extends Logging{
private[this] var decayFactor: Double = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just private is fine

@feynmanliang
Copy link
Contributor

LGTM after these changes and pending tests

CC @mengxr @freeman-lab

@rotationsymmetry
Copy link
Contributor Author

@feynmanliang Much appreciated. I have update the PR for your comments.

@mengxr
Copy link
Contributor

mengxr commented Oct 20, 2015

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Oct 20, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44017 has finished for PR 8022 at commit 9ba83cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Nov 3, 2015

@rotationsymmetry Could you provide a simple unit test in Java to show Java compatibility?

val lambda = numNewDataPoints / updatedDataWeight

BLAS.scal(lambda, newModel.weights)
BLAS.axpy(1-lambda, model.get.weights, newModel.weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some references about this merging scheme? I assume that this works for many cases, but there is no guarantee in theory.

rotationsymmetry and others added 11 commits November 8, 2015 14:49
Revise test "parameter accuracy" in StreamingLinearRegressionSuite to account for decay.
Split StreamingDecay into two traits.
Update StreamingLogisticRegressionWithSGD.
Update test suites.
Also make StreamingDecaySetter to be private[mllib].
Add ScalaDoc for public API.
Add ScalaDoc to decribe the forgetful algorithm in StreamingLinearAlgorithm.
Remove F-polymorphism in StreamingDecaySetter[T].
decayFactor and timeUnit in StreamingDecaySetter[T] are now private.
Remove division by zero in trainOn of StreamingLinearAlgorithm; provide comments to explains why.
Improve testing cases of StreamingLogisticRegressionSuite to have rel tol=0.1.
Refactor StreamingDecay
Use case object for TimeUnit
Clean up ScalaDoc
clean up new lines and comments.
@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45336 has finished for PR 8022 at commit 0072400.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants